part 1
Two Causally Related Needles in a Video Haystack
Li, Miaoyu, Chao, Qin, Li, Boyang
Properly evaluating the ability of Video-Language Models (VLMs) to understand long videos remains a challenge. We propose a long-context video understanding benchmark, Causal2Needles, that assesses two crucial abilities insufficiently addressed by existing benchmarks: (1) extracting information from two separate locations (two needles) in a long video and understanding them jointly, and (2) modeling the world in terms of cause and effect in human behaviors. Causal2Needles evaluates these abilities using noncausal one-needle, causal one-needle, and causal two-needle questions. The most complex question type, causal two-needle questions, require extracting information from both the cause and effect events from a long video and the associated narration text. To prevent textual bias, we introduce two complementary question formats: locating the video clip containing the answer, and verbal description of a visual detail from that video clip. Our experiments reveal that models excelling on existing benchmarks struggle with causal 2-needle questions, and the model performance is negatively correlated with the distance between the two needles. These findings highlight critical limitations in current VLMs. The dataset is available at: https://huggingface.co/datasets/causal2needles/Causal2Needles
- Asia > Middle East > Republic of Türkiye > Batman Province > Batman (0.04)
- Asia > Singapore (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > China (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Media > Film (0.46)
- Education (0.46)
- Information Technology (0.46)
Provably safe and human-like car-following behaviors: Part 2. A parsimonious multi-phase model with projected braking
Ensuring safe and human-like trajectory planning for automated vehicles amidst real-world uncertainties remains a critical challenge. While existing car-following models often struggle to consistently provide rigorous safety proofs alongside human-like acceleration and deceleration patterns, we introduce a novel multi-phase projection-based car-following model. This model is designed to balance safety and performance by incorporating bounded acceleration and deceleration rates while emulating key human driving principles. Building upon a foundation of fundamental driving principles and a multi-phase dynamical systems analysis (detailed in Part 1 of this study \citep{jin2025WA20-02_Part1}), we first highlight the limitations of extending standard models like Newell's with simple bounded deceleration. Inspired by human drivers' anticipatory behavior, we mathematically define and analyze projected braking profiles for both leader and follower vehicles, establishing safety criteria and new phase definitions based on the projected braking lead-vehicle problem. The proposed parsimonious model combines an extended Newell's model for nominal driving with a new control law for scenarios requiring projected braking. Using speed-spacing phase plane analysis, we provide rigorous mathematical proofs of the model's adherence to defined safe and human-like driving principles, including collision-free operation, bounded deceleration, and acceptable safe stopping distance, under reasonable initial conditions. Numerical simulations validate the model's superior performance in achieving both safety and human-like braking profiles for the stationary lead-vehicle problem. Finally, we discuss the model's implications and future research directions.
- North America > United States > California > Orange County > Irvine (0.14)
- Oceania > Australia > Queensland (0.04)
- Automobiles & Trucks (1.00)
- Transportation > Ground > Road (0.93)
Using Large Language Models for Automated Grading of Student Writing about Science
Impey, Chris, Wenger, Matthew, Garuda, Nikhil, Golchin, Shahriar, Stamer, Sarah
Assessing writing in large classes for formal or informal learners presents a significant challenge. Consequently, most large classes, particularly in science, rely on objective assessment tools such as multiple-choice quizzes, which have a single correct answer. The rapid development of AI has introduced the possibility of using large language models (LLMs) to evaluate student writing. An experiment was conducted using GPT-4 to determine if machine learning methods based on LLMs can match or exceed the reliability of instructor grading in evaluating short writing assignments on topics in astronomy. The audience consisted of adult learners in three massive open online courses (MOOCs) offered through Coursera. One course was on astronomy, the second was on astrobiology, and the third was on the history and philosophy of astronomy. The results should also be applicable to non-science majors in university settings, where the content and modes of evaluation are similar. The data comprised answers from 120 students to 12 questions across the three courses. GPT-4 was provided with total grades, model answers, and rubrics from an instructor for all three courses. In addition to evaluating how reliably the LLM reproduced instructor grades, the LLM was also tasked with generating its own rubrics. Overall, the LLM was more reliable than peer grading, both in aggregate and by individual student, and approximately matched instructor grades for all three online courses. The implication is that LLMs may soon be used for automated, reliable, and scalable grading of student science writing.
- North America > United States > Arizona > Pima County > Tucson (0.14)
- North America > United States > New Mexico > Los Alamos County > Los Alamos (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Instructional Material > Online (1.00)
- Instructional Material > Course Syllabus & Notes (1.00)
- Research Report > Experimental Study > Negative Result (0.46)
- Education > Educational Technology > Educational Software > Computer-Aided Assessment (1.00)
- Education > Educational Technology > Educational Software > Computer Based Training (1.00)
- Education > Educational Setting > Online (1.00)
- Education > Curriculum > Subject-Specific Education (1.00)
- Information Technology > Enterprise Applications > Human Resources > Learning Management (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Curse of Attention: A Kernel-Based Perspective for Why Transformers Fail to Generalize on Time Series Forecasting and Beyond
Ke, Yekun, Liang, Yingyu, Shi, Zhenmei, Song, Zhao, Yang, Chiwun
The application of transformer-based models on time series forecasting (TSF) tasks has long been popular to study. However, many of these works fail to beat the simple linear residual model, and the theoretical understanding of this issue is still limited. In this work, we propose the first theoretical explanation of the inefficiency of transformers on TSF tasks. We attribute the mechanism behind it to {\bf Asymmetric Learning} in training attention networks. When the sign of the previous step is inconsistent with the sign of the current step in the next-step-prediction time series, attention fails to learn the residual features. This makes it difficult to generalize on out-of-distribution (OOD) data, especially on the sign-inconsistent next-step-prediction data, with the same representation pattern, whereas a linear residual network could easily accomplish it. We hope our theoretical insights provide important necessary conditions for designing the expressive and efficient transformer-based architecture for practitioners.
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
- Asia > Vietnam (0.04)
- Asia > China > Hong Kong (0.04)
Concentration of Cumulative Reward in Markov Decision Processes
Sayedana, Borna, Caines, Peter E., Mahajan, Aditya
In this paper, we investigate the concentration properties of cumulative rewards in Markov Decision Processes (MDPs), focusing on both asymptotic and non-asymptotic settings. We introduce a unified approach to characterize reward concentration in MDPs, covering both infinite-horizon settings (i.e., average and discounted reward frameworks) and finite-horizon setting. Our asymptotic results include the law of large numbers, the central limit theorem, and the law of iterated logarithms, while our non-asymptotic bounds include Azuma-Hoeffding-type inequalities and a non-asymptotic version of the law of iterated logarithms. Additionally, we explore two key implications of our results. First, we analyze the sample path behavior of the difference in rewards between any two stationary policies. Second, we show that two alternative definitions of regret for learning policies proposed in the literature are rate-equivalent. Our proof techniques rely on a novel martingale decomposition of cumulative rewards, properties of the solution to the policy evaluation fixed-point equation, and both asymptotic and non-asymptotic concentration results for martingale difference sequences.
- North America > Canada > Quebec > Montreal (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > New Jersey > Hudson County > Hoboken (0.04)
- (5 more...)
Capturing Sparks of Abstraction for the ARC Challenge
Excellent progress has been made recently in solving ARC Challenge problems. However, it seems that new techniques may be required to push beyond 60% accuracy. Even commercial Large Language Models (LLMs) struggle to 'understand' many of the problems (when given the input and output grids), which makes discovering solutions by LLM-lead program search somewhat futile. In this work, LLM 'understanding' is attempted from a stronger starting position : An LLM is given complete solutions to tasks in code, and then asked to explain how the task is being solved at various levels of abstraction. Specifically, the LLM was given code solutions implemented in arc-dsl-llm (an LLM-legible version of Hodel's arc-dsl to obtain: (a) commented code; (b) code refactored into reusable functional chunks; (c) problem solution steps; and (d) high-level problem-solving tactics. We demonstrate that 'Sparks of Abstraction' can be extracted from the LLM output - in a form that could be used in downstream tasks with Local LLMs eligible to enter the ARC Prize. Both the arc-dsl-llm DSL framework (with the re-engineered solutions) and the Gemini LLM-generated data (along with the generation code) are made Open Source.
- Workflow (0.68)
- Research Report (0.65)
Annotation Guidelines for Corpus Novelties: Part 1 -- Named Entity Recognition
Amalvy, Arthur, Labatut, Vincent
It was constituted mainly to fulfill two goals: in the short term, train and test NER methods able to handle long texts, and in the longer term, be used to develop Renard [3], a pipeline aiming at extracting character networks from literary fiction. This pipeline includes several processing steps after the NER, including coreference resolution and character unification. Character networks can be used to tackle a number of tasks, including the assessment of literary theories, the level of historicity of a narrative, detecting roles in stories, classifying novels, identify subplots, segment a storyline, summarize a story, design recommendation systems, align narratives, etc. See the detailed survey of Labatut and Bost [11] for more information regarding character networks. This context drives the elaboration of the corpus, which explains why it exhibits certain differences with many similar NER corpora, such as CoNLL-2003 [17] or OntoNotes v5 [20]. We originally based Novelties on the literary corpus from Dekker et al. [6] as we describe in Section A of the appendix. Note that there are other literary NER corpora (cf.
- Europe > United Kingdom > England (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > New York (0.04)
- (6 more...)
Revisiting Extragradient-Type Methods -- Part 1: Generalizations and Sublinear Convergence Rates
Tran-Dinh, Quoc, Nguyen-Trung, Nghia
This paper presents a comprehensive analysis of the well-known extragradient (EG) method for solving both equations and inclusions. First, we unify and generalize EG for [non]linear equations to a wider class of algorithms, encompassing various existing schemes and potentially new variants. Next, we analyze both sublinear ``best-iterate'' and ``last-iterate'' convergence rates for the entire class of algorithms, and derive new convergence results for two well-known instances. Second, we extend our EG framework above to ``monotone'' inclusions, introducing a new class of algorithms and its corresponding convergence results. Third, we also unify and generalize Tseng's forward-backward-forward splitting (FBFS) method to a broader class of algorithms to solve [non]linear inclusions when a weak-Minty solution exists, and establish its ``best-iterate'' convergence rate. Fourth, to complete our picture, we also investigate sublinear rates of two other common variants of EG using our EG analysis framework developed here: the reflected forward-backward splitting and the golden ratio methods. Finally, we conduct an extensive numerical experiment to validate our theoretical findings. Our results demonstrate that several new variants of our proposed algorithms outperform existing schemes in the majority of examples.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Europe > Russia (0.04)
- Asia > Russia (0.04)
- (4 more...)